In [3]:

    
import pandas as pd
import glob
import os
import numpy as np
from __future__ import division
import sqlalchemy as sqla
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import linear_model
import matplotlib

Intro: Data and goals

Data comes from two sources:

TLC data: (http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml or (just includes new data) https://data.cityofnewyork.us/data?agency=Taxi+and+Limousine+Commission+%28TLC%29&cat=&type=new_view&browseSearch=&scope= )
- This data includes gps coordinates for every single yellow cab trip in NYC, as well as fare data, the taxi medallion, and a bunch of other info
Census/American Community Survey (ACS) data:
- Income, commute data, and much, much more.
- Data from the census makes the best possible attempt to record every single resident of the US, but is mostly just counting population, with only a small amount of other info.
- Data from ACS uses a random sample and gets diverse demographic information; though some of this information may have very high errors.
- Lots of sources, I've used https://nhgis.org/, as it allows you to just download the columns you want and not waste hard drive space or time processing lots of data.
- The census supplies "shapefiles" of the different geographical areas they use at https://www.census.gov/geo/maps-data/data/tiger-line.html. I divided mine up by census tract, so I got census tract shapefiles.
- Unfortunately, for some fields, there is simply too much error for them to be useful. The ACS contains info on the length of time spent commuting; for car commuters, the error for their data in NYC had a median around 50% and was simply not usable.
- It will be interesting to see what effect race has on taxi dropoffs - does the conventional wisdom on taxis being prejudiced show up in dropoff data?
I have also taken a look at some Google maps distance data, but it wasn't very predictive on its own. Maybe combining in some of the demographic data will make it a bit more useful.

My goal will be to predict the number of taxi dropoffs in an area based on demographic and other stats about the area. Ideally, I would like to understand the demographics of taxicab users and how well taxicabs provide service to NYC residents. I may later expand my data to look at green "boro cabs" (which are only allowed to pick up passengers outside of the Manhattan central business district, data also availible from TLC) and Uber cars(https://github.com/fivethirtyeight/uber-tlc-foil-response), to compare their demographics of usage.

If I do something with the Uber data, I plan to write up a story in the style of fivethirtyeight.com that I may submit to them, as they have done a series of stories on this Uber data.

I would also expect that a paper comparing taxi services would be useful for city government, who want to ensure that there is good coverage throughout the city and throughout demographic groups.

Analysis

I'm only using census tracts where the population is greater than 1000. Most of the tracts with population less than that are parks or ocean, and the errors associated with data on these are generally very high. Also, taxi dropoffs are probably not representative of the resident population for these areas.



In [4]:

    
engine = sqla.create_engine('postgresql://mydatabase:notyours@localhost:5432/TaxiData',echo=False)
full=pd.read_sql_query('SELECT a.* FROM lotsofdata AS a WHERE a.totalpopulation>=1000',engine).set_index('fipscodes')

For this analysis, I'll be looking at data from three columns of the ACS, with per-capita dropoffs as my dependent variable, i.e. the dropoffs computed above divided by the population given by the census.

This data uses the 2010 census (the most recent), and 5-year data for the American Community Survey, 2007-2011. I figure the ACS data splits the difference between my 2010 population data and the 2013 drop-off data.

The census uses some funky abreviations for their column headers, and I haven't changed them. Here's a description:

MOGE001 : Workers 16 years and over who did not work at home.
MOGE011 : Workers 16 years and over who did not work at home and commuted by car, truck, or van.
MRUE001 : Per Capita Income in the Past 12 Months (in 2011 Inflation-Adjusted Dollars)

I compute the ratio of non-driving commuters to total commuters, labeled nondrivercommuterrat



In [5]:

    
subset=full[['MOGE001','MOGE011','MRUE001','abridged2013ycdrpoffpc']]
subset['nondrivercommuterrat']=((subset['MOGE001']-subset['MOGE011'])/subset['MOGE001'])
subset.replace(np.inf,np.nan,inplace=True)
subset.dropna(inplace=True)
subset.head()









    



/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy






    Out[5]:






  
    
      
      MOGE001
      MOGE011
      MRUE001
      abridged2013ycdrpoffpc
      nondrivercommuterrat
    
    
      fipscodes
      
      
      
      
      
    
  
  
    
      36081150702
      1337
      1003
      39608
      0.043738
      0.249813
    
    
      36085031902
      1567
      887
      21271
      0.004538
      0.433950
    
    
      36047105804
      1946
      698
      16944
      0.028478
      0.641316
    
    
      36061000201
      986
      180
      11770
      0.828973
      0.817444
    
    
      36085013400
      1711
      1196
      35314
      0.007096
      0.300994



In [7]:

    
# Plotting the data:
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0
buff=100

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
plt.autoscale(tight=True)
plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")
a=plt.scatter(subset['MRUE001'],subset['abridged2013ycdrpoffpc'],alpha=0.5,label='Data',c=subset['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
plt.yscale('log')
plt.xscale('log')
#plt.legend(loc='upper left')
plt.show()

There appears to be a bit of a linear relationship between the logs of the dropoffs per-capita and per-capita income, but it's not that great. There is a funky area where there almost looks like there is a negative slope, but we can see that this area tends to have less non-driver commuters based on the color shown.



In [28]:

    
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
#xrdata=((((subset[subset['nondrivercommuterpc']<=comd]['nondrivercommuterpc']))).as_matrix())
#yrdata=subset[subset['nondrivercommuterpc']<=comd]['abridged2013ycdrpoffpc'].as_matrix()
#plt.xlim(-(0.001)**2,(0.003)**2)
buff=100
XX=np.linspace(xdata.min()-buff,xdata.max()+(buff),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])

#plt.xlim(xdata.min()-buff,xdata.max()+buff)
plt.xlabel('Income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
#plt.ylim(-0.2,7)
plt.title("Yellow cab dropoffs vs income of census tract")
#plt.title("Trip duration matters")
#plt.xlim(0,0.002)
plt.autoscale(enable=True, axis='both', tight=True)
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')

a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('cool'))


#plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',c=subset[subset['nondrivercommuterpc']<=0.5]['nondrivercommuterpc'],cmap=plt.cm.get_cmap('autumn'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',color='red')
#plt.colorbar(a)
plt.yscale('log')
plt.xscale('log')

#fitlabel='Fit: '+str(round(res.params['esb-inv'],2))+ '/x - '+ str(abs(round(res.params['const'],2)))
#plt.plot(XX,res.params['esb-inv']/XX+res.params['esb-centroidduration']*XX+res.params['const'],alpha=0.9,color='red',linewidth=2,label=fitlabel)
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()

If we look at the data with only a high non-driver commuter ratio, the linear relationship is very clear. That funky corner only consists of lower non-driver commuter ratios.

As for the rest of the data, the relationship isn't really obvious. At least using just per-capita income as a dependent variable, there doesn't seem to be much correlation (see blob below), and I will need to explore other variables. I might want to use some more advance machine learning techniques in order to figure out which data is useful.



In [9]:

    
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']<comd)]
bsub=subset[(subset['nondrivercommuterrat']>comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
#xrdata=((((subset[subset['nondrivercommuterpc']<=comd]['nondrivercommuterpc']))).as_matrix())
#yrdata=subset[subset['nondrivercommuterpc']<=comd]['abridged2013ycdrpoffpc'].as_matrix()
#plt.xlim(-(0.001)**2,(0.003)**2)
buff=100
XX=np.linspace(xdata.min()-buff,xdata.max()+(buff),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])

#plt.xlim(xdata.min()-buff,xdata.max()+buff)
plt.xlabel('Income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
#plt.ylim(-0.2,7)
plt.title("Yellow cab dropoffs vs income of census tract")
#plt.title("Trip duration matters")
#plt.xlim(0,0.002)
plt.autoscale(enable=True, axis='both', tight=True)

a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $<$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))

plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $>$ 0.8')

#plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',c=subset[subset['nondrivercommuterpc']<=0.5]['nondrivercommuterpc'],cmap=plt.cm.get_cmap('autumn'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',color='red')
#plt.colorbar(a)
plt.yscale('log')
plt.xscale('log')

#fitlabel='Fit: '+str(round(res.params['esb-inv'],2))+ '/x - '+ str(abs(round(res.params['const'],2)))
#plt.plot(XX,res.params['esb-inv']/XX+res.params['esb-centroidduration']*XX+res.params['const'],alpha=0.9,color='red',linewidth=2,label=fitlabel)
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()

We can do some regression on the high non-driver commuter data though:



In [31]:

    
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
model=sm.OLS(np.log(asub['abridged2013ycdrpoffpc']),sm.add_constant(np.log(asub['MRUE001']),prepend=False))
fit=model.fit()
fit.summary()









    Out[31]:





OLS Regression Results

  Dep. Variable:     abridged2013ycdrpoffpc    R-squared:             0.689 


  Model:                       OLS             Adj. R-squared:        0.689 


  Method:                 Least Squares        F-statistic:           1507. 


  Date:                 Sat, 02 Jan 2016       Prob (F-statistic):  1.17e-174


  Time:                     17:51:15           Log-Likelihood:      -1086.7 


  No. Observations:             682            AIC:                   2177. 


  Df Residuals:                 680            BIC:                   2187. 


  Df Model:                       1                                         


  Covariance Type:          nonrobust                                       




             coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  MRUE001      2.2805      0.059     38.820   0.000      2.165     2.396


  const      -23.8229      0.605    -39.367   0.000    -25.011   -22.635




  Omnibus:        50.788    Durbin-Watson:         1.209


  Prob(Omnibus):   0.000    Jarque-Bera (JB):     71.269


  Skew:            0.586    Prob(JB):           3.34e-16


  Kurtosis:        4.065    Cond. No.               138.



In [30]:

    
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()

buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])


plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")

plt.autoscale(enable=True, axis='y', tight=True)


a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')

plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")

plt.yscale('log')
plt.xscale('log')

plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()



In [32]:

    
fig=sm.qqplot(fit.resid,line='45')

fig.show()



In [135]:

    
hist=fit.resid.hist(alpha=0.8)

I have to admit, I'm not quite sure what to think of these plots. Clearly my data isn't exactly normally distributed, but this is the real world so that's not surprising. Does it deviate from a normal distribution enough that I should be worried? I don't know.

I'm going to take a look at the outliers and high leverage points regardless:



In [14]:

    
from statsmodels.graphics import regressionplots

from statsmodels.stats import outliers_influence

influence=outliers_influence.OLSInfluence(fit)

infframe=influence.summary_frame()

outliers=infframe[np.abs(infframe['student_resid'])>3].index



In [45]:

    
asub.loc[outliers]









    Out[45]:






  
    
      
      MOGE001
      MOGE011
      MRUE001
      abridged2013ycdrpoffpc
      nondrivercommuterrat
    
    
      fipscodes
      
      
      
      
      
    
  
  
    
      36061010100
      502
      38
      69730
      288.611111
      0.924303
    
    
      36061011900
      361
      0
      23206
      151.490179
      1.000000
    
    
      36061019600
      932
      77
      11176
      4.876876
      0.917382
    
    
      36061020300
      1128
      55
      8484
      5.658134
      0.951241
    
    
      36061024000
      131
      5
      5517
      0.947209
      0.961832



In [46]:

    
outliers.tolist()









    Out[46]:





[u'36061010100',
 u'36061011900',
 u'36061019600',
 u'36061020300',
 u'36061024000']



In [49]:

    
from IPython.display import Image
Image(filename='Outliers.png')









    Out[49]:

These five outliers consist of:

The tract centered on Times Square, which is a major transit hub and contains one of the largest residences for the homeless in the US.
The tract centered on Penn Station has the biggest transit hub in NYC.
The tract that consists only of Columbia University. I imagine that the population there is mostly students in dorms, many of which probably have nearly no income but really rich parents.
There is another one centered on the 125th street Metro-North train station. This one is the most surprising to me, as I didn't think it was that big of a transit hub.
Finally, there is one that consists of Randall's/Ward's Island. I'm fairly certain that just about all the population is people who live in the various mental institutions and homeless shelters on the island. It also has terrible public transit access, but lots of sports fields and event spaces, so I imagine that there are plenty of people going to events that take cabs.



In [52]:

    
asub.drop(outliers,inplace=True)
model=sm.OLS(np.log(asub['abridged2013ycdrpoffpc']),sm.add_constant(np.log(asub['MRUE001']),prepend=False))
fit=model.fit()
fit.summary()









    



/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':






    Out[52]:





OLS Regression Results

  Dep. Variable:     abridged2013ycdrpoffpc    R-squared:             0.718 


  Model:                       OLS             Adj. R-squared:        0.718 


  Method:                 Least Squares        F-statistic:           1719. 


  Date:                 Mon, 04 Jan 2016       Prob (F-statistic):  9.76e-188


  Time:                     14:04:33           Log-Likelihood:      -1039.9 


  No. Observations:             677            AIC:                   2084. 


  Df Residuals:                 675            BIC:                   2093. 


  Df Model:                       1                                         


  Covariance Type:          nonrobust                                       




             coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  MRUE001      2.3166      0.056     41.462   0.000      2.207     2.426


  const      -24.2279      0.576    -42.076   0.000    -25.359   -23.097




  Omnibus:         9.661    Durbin-Watson:         1.185


  Prob(Omnibus):   0.008    Jarque-Bera (JB):      9.286


  Skew:            0.249    Prob(JB):            0.00963


  Kurtosis:        2.716    Cond. No.               138.



In [53]:

    
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
#comd=0.8
#asub=subset[(subset['nondrivercommuterrat']>comd)]
#bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()

buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])


plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")

plt.autoscale(enable=True, axis='y', tight=True)


a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')

plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")

plt.yscale('log')
plt.xscale('log')

plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()

Removing the outliers pops up the $R^2$ and, more importantly, creates a more representative fit line.

Next I'll look at high leverage points:



In [63]:

    
highleverage=infframe[infframe['hat_diag']>(infframe['hat_diag'].mean()*3)].index



In [77]:

    
Image(filename='HighL_Outliers.png')









    Out[77]:

These are some of the highest income census tracts in the US (world?), possibly the highest. I feel like I read that somewhere.

Dropping these high leverage points as well as the outliers gives us:



In [64]:

    
asub.drop(highleverage,inplace=True)
model=sm.OLS(np.log(asub['abridged2013ycdrpoffpc']),sm.add_constant(np.log(asub['MRUE001']),prepend=False))
fit=model.fit()
fit.summary()









    



/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':






    Out[64]:





OLS Regression Results

  Dep. Variable:     abridged2013ycdrpoffpc    R-squared:             0.714 


  Model:                       OLS             Adj. R-squared:        0.713 


  Method:                 Least Squares        F-statistic:           1670. 


  Date:                 Mon, 04 Jan 2016       Prob (F-statistic):  3.90e-184


  Time:                     14:28:25           Log-Likelihood:      -1032.9 


  No. Observations:             672            AIC:                   2070. 


  Df Residuals:                 670            BIC:                   2079. 


  Df Model:                       1                                         


  Covariance Type:          nonrobust                                       




             coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  MRUE001      2.3369      0.057     40.869   0.000      2.225     2.449


  const      -24.4302      0.588    -41.517   0.000    -25.586   -23.275




  Omnibus:         9.225    Durbin-Watson:         1.172


  Prob(Omnibus):   0.010    Jarque-Bera (JB):      8.897


  Skew:            0.245    Prob(JB):             0.0117


  Kurtosis:        2.722    Cond. No.               141.

The $R^2$ is slightly lower, but the parameters are probably better representative of the whole of the data.



In [65]:

    
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
#comd=0.8
#asub=subset[(subset['nondrivercommuterrat']>comd)]
#bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()

buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])


plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")

plt.autoscale(enable=True, axis='y', tight=True)


a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')

plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")

plt.yscale('log')
plt.xscale('log')

plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()

Below is all the data, so that you can see the dropped outliers and high leverage points as well. There are a total of ten out of the 682 total data points that are dropped, so we are dropping a very small fraction of the data.



In [76]:

    
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
#comd=0.8
#asub=subset[(subset['nondrivercommuterrat']>comd)]
#bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()

buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])


plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")

plt.autoscale(enable=True, axis='y', tight=True)


a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')
plt.scatter(subset.loc[outliers]['MRUE001'],subset.loc[outliers]['abridged2013ycdrpoffpc'],alpha=0.25,label='Outliers',color='red')
plt.scatter(subset.loc[highleverage]['MRUE001'],subset.loc[highleverage]['abridged2013ycdrpoffpc'],alpha=0.3,label='High leverage',color='orange')

plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")

plt.yscale('log')
plt.xscale('log')

plt.legend(loc='lower right')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()

Now we can see that the remaining data is much closer to normally distributed:



In [71]:

    
fig=sm.qqplot(fit.resid,line='45')

fig.show()



In [72]:

    
hist=fit.resid.hist(alpha=0.8)

We've got a fit that explains most of the variance, with an $R^2$ of 0.7. I suspect that I won't be able to further explain this subset (non-driver commute ratio > 0.8) of the data, but maybe I will.

However, there is still lots to do. I should probably do an analysis of error in this data, to see if there is really any more variance that could possibly be explained (i.e. much of the variance might be due to random fluctuations). I also might want to do some analysis to better understand the causal relationship between these two variables. Many of these dropoffs are not very representative of the population, but instead are common destinations, eg Grand Central and Penn Stations, or the area around Central Park. In this case, the common destination causes high rents and thus high incomes, so it can't be concluded that the rich are more likely to take cabs.

Also, to really do this analysis properly, I should't be using all the data and should do some sampling. I figure I can get to that once I've got a model that explains the rest of the data as well.



In [ ]:

	MOGE001	MOGE011	MRUE001	abridged2013ycdrpoffpc	nondrivercommuterrat
fipscodes
36081150702	1337	1003	39608	0.043738	0.249813
36085031902	1567	887	21271	0.004538	0.433950
36047105804	1946	698	16944	0.028478	0.641316
36061000201	986	180	11770	0.828973	0.817444
36085013400	1711	1196	35314	0.007096	0.300994

Dep. Variable:	abridged2013ycdrpoffpc	R-squared:	0.689
Model:	OLS	Adj. R-squared:	0.689
Method:	Least Squares	F-statistic:	1507.
Date:	Sat, 02 Jan 2016	Prob (F-statistic):	1.17e-174
Time:	17:51:15	Log-Likelihood:	-1086.7
No. Observations:	682	AIC:	2177.
Df Residuals:	680	BIC:	2187.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
MRUE001	2.2805	0.059	38.820	0.000	2.165 2.396
const	-23.8229	0.605	-39.367	0.000	-25.011 -22.635

Omnibus:	50.788	Durbin-Watson:	1.209
Prob(Omnibus):	0.000	Jarque-Bera (JB):	71.269
Skew:	0.586	Prob(JB):	3.34e-16
Kurtosis:	4.065	Cond. No.	138.

	MOGE001	MOGE011	MRUE001	abridged2013ycdrpoffpc	nondrivercommuterrat
fipscodes
36061010100	502	38	69730	288.611111	0.924303
36061011900	361	0	23206	151.490179	1.000000
36061019600	932	77	11176	4.876876	0.917382
36061020300	1128	55	8484	5.658134	0.951241
36061024000	131	5	5517	0.947209	0.961832

Omnibus:	9.661	Durbin-Watson:	1.185
Prob(Omnibus):	0.008	Jarque-Bera (JB):	9.286
Skew:	0.249	Prob(JB):	0.00963
Kurtosis:	2.716	Cond. No.	138.

Omnibus:	9.225	Durbin-Watson:	1.172
Prob(Omnibus):	0.010	Jarque-Bera (JB):	8.897
Skew:	0.245	Prob(JB):	0.0117
Kurtosis:	2.722	Cond. No.	141.